Layer Breakdown - MTN Cloud Streaming Platform

Edge Layer

The edge layer is the exclusive entry point for all client traffic. It provides two distinct paths: CDN-mediated media segment delivery and API Gateway-mediated application request routing.

Attribute	Detail
Responsibilities	DDoS absorption, WAF policy enforcement, TLS termination, CDN caching of media segments and manifests, geographic routing, API rate limiting, JWT pre-validation
Core Services	WAF (AWS Shield Advanced / Cloudflare Enterprise), CDN (CloudFront / Akamai with MTN PoP integration), API Gateway (Kong or AWS API Gateway with custom authoriser)
Scaling Model	CDN scales elastically per edge node. API Gateway horizontally scales behind a load balancer. WAF is managed/serverless.
Failure Domains	CDN edge node failure routes to the next nearest PoP. API Gateway node failure is handled by load balancer health checks. WAF failure open-circuits to allow traffic — availability is prioritised over WAF enforcement during an outage, with immediate alerting.

Application Layer

All business logic services run as independently deployable microservices. Services are stateless — all session state is held in Redis, not in process memory.

Attribute	Detail
Core Services	User & Auth Service, Upload Service, Content Service, Engagement Service, Playback Service, Subscription Service, Creator Dashboard, Notification Service, Admin Control Plane
Scaling Model	Kubernetes HPA on CPU (target 60%) and custom metrics (request queue depth). Each service scales independently. The Engagement Service uses Redis-buffered write batching to handle viral content write amplification.
Failure Domains	Individual service failure is circuit-broken. Playback Service maintains a 3-replica minimum with a dedicated node pool. Auth Service failure degrades to cached session validation for up to 5 minutes. Engagement Service failures queue writes client-side for retry; view counts tolerate brief outages via eventual consistency.

Zero-Trust boundary. Every inter-service call on MCSP requires a valid mTLS client certificate issued per service identity. No internal endpoint is reachable without mutual authentication — the service mesh (Istio) enforces this independently of application code.

Media Processing Layer

The media processing layer is a fully asynchronous pipeline triggered by upload completion events. Video and audio jobs share the same Kafka-backed job queue and worker pool — content-type metadata in the job descriptor determines the processing branch applied at the transcoding stage.

Attribute	Detail
Responsibilities	Format validation, virus scanning, AI copyright fingerprinting, multi-resolution video transcoding, multi-bitrate audio transcoding, HLS/DASH packaging, DRM encryption, thumbnail and cover art generation, metadata indexing
Core Services	Upload Ingestor, Copyright Scanner (perceptual hash + audio fingerprint), Transcoding Cluster (FFmpeg — GPU for 4K video, CPU-only for audio), DRM Packager (Shaka), Art/Thumbnail Generator, Metadata Indexer
Scaling Model	Audio and video job queues use separate autoscaling profiles. Spot/preemptible instances are used for transcoding (60–70% cost reduction). A minimum of 2 workers is always running to prevent cold-start latency.
Failure Domains	Failed jobs retry with exponential backoff (max 5 attempts) before moving to a dead-letter queue with creator notification. Partial failures (e.g., 4K transcode fails while 1080p succeeds) publish available variants immediately without blocking lower resolutions.

AI / ML Layer

The ML layer operates on two timescales: offline batch training (daily/weekly) and online real-time inference (sub-100 ms per request).

Attribute	Detail
Responsibilities	Behavioural event collection, feature engineering, offline model training, online recommendation serving, AI content moderation
Core Services	Event Collector (Kafka consumer), Feature Store (Feast / Tecton), Offline Trainer (Spark on Kubernetes + Ray), Model Server (Triton / TorchServe), AI Moderation Pipeline
Scaling Model	Model server scales horizontally behind a load balancer. GPU nodes dedicated to inference; CPU nodes for feature serving. Training cluster scales on-demand for scheduled jobs.
Failure Domains	Recommendation inference failure falls back to the trending content feed. AI moderation failure queues content for human review — content is never auto-approved during an outage. Feature store unavailability degrades to cached features.

Data Layer

Each data class uses a purpose-fit store. No store is shared across unrelated data domains.

Store	Purpose	Failure Mode
PostgreSQL (multi-AZ, read replicas)	Users, content metadata, subscriptions, transactions	Primary failure triggers automated standby failover (RTO < 30 seconds)
Object Storage (S3-compatible)	Hot, cold, and residency-isolated media file buckets	11-nine durability; cross-AZ replication. Residency buckets never replicate cross-region.
Elasticsearch	Full-text content search and discovery	Failure degrades search — not on the playback critical path
TimescaleDB / ClickHouse	Analytics time-series, creator metrics	Degraded analytics; no impact on streaming
Redis Cluster	Sessions, idempotency keys, engagement counters, ML inference cache	Failure degrades performance but not correctness — sessions fall back to DB validation

Control Plane

The control plane is operationally isolated from the viewer-facing data plane with its own deployment, network boundaries, and scaling policies.

Attribute	Detail
Core Services	Admin Control Plane API, Moderation Dashboard, Residency Policy Engine, Ad Operations Console, Audit Log Service
Scaling Model	Scaled conservatively — handles significantly lower RPS than the data plane. Admin operations are rate-limited to prevent bulk operational errors.
Failure Domains	Control plane failure does not impact viewer streaming. Moderation pipeline failure routes all flagged content to a holding queue — content is not auto-approved during an outage.

Observability Layer

Observability is a deployment gate. Services that do not emit structured logs, RED metrics (Rate, Errors, Duration), and distributed traces will fail the CI/CD pipeline health check and cannot be deployed to production.

Component	Role
Loki / ELK Stack	Centralised structured log aggregation and search
Prometheus + Grafana	System and application metrics; SLO dashboards
Jaeger / Tempo	Distributed request tracing (1–5% sampled on high-volume paths)
PagerDuty	Alerting and on-call routing
Append-only Audit Store	DynamoDB (no-delete policy) or Kafka Compacted Topic — compliance-grade immutable record

Metrics are retained at high resolution for 7 days and downsampled for 1 year. Audit log records are partitioned and tiered to cold storage after 90 days but are never deleted.

Documentation Index

​Edge Layer

​Application Layer

​Media Processing Layer

​AI / ML Layer

​Data Layer

​Control Plane

​Observability Layer

Edge Layer

Application Layer

Media Processing Layer

AI / ML Layer

Data Layer

Control Plane

Observability Layer